First, we need a graph. A graph is just a bunch of objects, typically called nodes which are connected to each other. The connections are called edges, and can have a direction, like Tom owes money to Andy, or can be non-directional, like a road connecting two cities.
A graph is a collection of nodes and edges. The BFS algo can tell us if two nodes are connected, and finds the shortest path b/w them.
In [226]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from geopy.distance import great_circle
from collections import deque
To make the graph interesting, I am using the airlines and route information database from openflights.org.
So we have two datasets, one about airports, and one about routes b/w airports. For the purpose of the graph, the airports are the nodes, and the routes are the edges b/w them. The routes have a direction, and the cost or weight of the route is the distance.
In [500]:
cols = ["Airport ID", "Name", "City", "Country", "IATA", "ICAO", "Latitude", "Longitude", "Altitude",
"Timezone", "DST", "Tz database", "Type", "Source"]
airports = pd.read_csv("data/airports.dat", header=None, names=cols, index_col=0)
print(f"There are {airports.shape[0]} airports")
airports.head()
Out[500]:
Step one is is to get rid of all the info we don't need for our graph.
So for Airports, I am just keeping:
In [501]:
# dropping the columns we don't need
keep_cols = ["City", "Country", "Latitude", "Longitude"]
airports = airports[keep_cols]
print(f"There are {airports.shape[0]} airports")
airports.head()
Out[501]:
In [502]:
# getting rid of null values
airports = airports.dropna(axis=0)
airports.shape
Out[502]:
In [331]:
cols = ["Airline", "Airline ID", "Source airport", "Source Airport ID", "Destination airport",
"Dest Airport ID", "Codeshare", "Stops", "Equipment"]
routes = pd.read_csv("data/routes.dat", header=None, names=cols)
print(f"There are {routes.shape[0]} routes")
routes.head()
Out[331]:
We just need:
In [332]:
# first drop all rows where stops aren't 0, as we only want direct connections
routes = routes[routes["Stops"] == 0]
keep_cols = ["Source Airport ID", "Dest Airport ID"]
routes = routes[keep_cols]
print(f"There are {routes.shape[0]} routes")
routes.head(10)
Out[332]:
Now there are some values which aren't numbers, so we need to clean them up, as you can see in row 7 above.
In [333]:
def make_int_or_null(x):
"""returns int or np.nan if can't return int"""
try:
return int(x)
except:
return np.nan
routes = routes.applymap(make_int_or_null)
In [334]:
print(f"There are {routes.shape[0]} routes before dropping null values")
routes = routes.dropna(axis=0)
print(f"there are {routes.shape[0]} after dropping null values")
In [335]:
routes.head()
Out[335]:
Now we need to be able to get the distance b/w two airports to be able to assign a weight in our graphs.
In [368]:
def route_distance(edge):
"""takes a route as a pandas row, and returns distance b/w them"""
src_airport = airports.iloc[int(edge["Source Airport ID"])]
dest_airport = airports.iloc[int(edge["Dest Airport ID"])]
src_lat = src_airport["Latitude"]
src_long = src_airport["Longitude"]
dest_lat = dest_airport["Latitude"]
dest_long = dest_airport["Longitude"]
src_loc = (float(src_lat), float(src_long))
dest_loc = (float(dest_lat), float(dest_long))
return great_circle(src_loc, dest_loc).km
#checking algo
print(route_distance(routes.iloc[100]))
so the data seems ready to go. We now have two pandas dataframes, airports and routes, and a function which returns the distance b/w airports.
there are many ways to make a graph. One way is to build out all the nodes and connections. So here I'll initiate the graph as a dictionary of nodes, with the data being the edges.
A blank graph where the key is each airport:
In [513]:
graph = {}
for i, row in airports.iterrows():
graph[i] = []
Now going through each route and adding (dest_airport, distance) to each src_airport in the graph:
In [514]:
from tqdm import tqdm
for i, row in tqdm(routes.iterrows(), total=len(routes), mininterval=0.5):
try:
src_airport = airports.iloc[int(row["Source Airport ID"])]
dest_airport = airports.iloc[int(row["Dest Airport ID"])]
except:
pass
dist = great_circle((src_airport["Latitude"], src_airport["Longitude"]),
(dest_airport["Latitude"],dest_airport["Longitude"]) ).km
n = int(row["Source Airport ID"])
d = int(row["Dest Airport ID"])
if n in graph.keys():
graph[n].append((d, int(dist)))
Now say I want to find out the graph for Karachi. Karachi has three airports, so only looking at the first one for now...
In [515]:
find = airports["City"] == "Karachi"
airports[find]
Out[515]:
So I can see all the cities it's connected to, though since some cities have multiple airports they show up twice. So how do I deal with that? I can ignore multiple airports as the distance is the same, though a more complex graph would take into account the different $$ value.
In [543]:
for i, t in enumerate(graph[2206]):
if t[0] in graph.keys():
print(i, airports.loc[t[0]]["City"], t[1])
else:
continue
In [525]:
airports.shape
Out[525]:
In [ ]: